Learning to integrate web taxonomies

نویسندگان

  • Dell Zhang
  • Wee Sun Lee
چکیده

We investigate machine learning methods for automatically integrating objects from different taxonomies into a master taxonomy. This problem is not only currently pervasive on the Web, but is also important to the emerging Semantic Web. A straightforward approach to automating this process would be to build classifiers through machine learning and then use these classifiers to classify objects from the source taxonomies into categories of the master taxonomy. However, conventional machine learning algorithms totally ignore the availability of the source taxonomies. In fact, source and master taxonomies often have common categories under different names or other more complex semantic overlaps. We introduce two techniques that exploit the semantic overlap between the source and master taxonomies to build better classifiers for the master taxonomy. The first technique, Cluster Shrinkage, biases the learning algorithm against splitting source categories by making objects in the same category appear more similar to each other. The second technique, Co-Bootstrapping, tries to facilitate the exploitation of inter-taxonomy relationships by providing category indicator functions as additional features for the objects. Our experiments with real-world Web data show that these proposed add-on techniques can enhance various machine learning algorithms to achieve substantial improvements in performance for taxonomy integration.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Learning to Integrate Web Taxonomies with Fine-Grained Relations: A Case Study Using Maximum Entropy Model

As web taxonomy integration is an emerging issue on the Internet, many research topics, such as personalization, web searches, and electronic markets, would benefit from further development of taxonomy integration techniques. The integration task is to transfer documents from a source web taxonomy to a target web taxonomy. In most current techniques, integration performance is enhanced by refer...

متن کامل

Web taxonomy integration with hierarchical shrinkage algorithm and fine-grained relations

We address the problem of integrating web taxonomies from different real Internet applications. Integrating web taxonomies is to transfer instances from a source to target taxonomy. Unlike the conventional text categorization problem, in taxonomy integration, the source taxonomy contains extra information that can be used to improve the categorization. The major existing methods can be divided ...

متن کامل

Web Scale Taxonomy Cleansing

Large ontologies and taxonomies are automatically harvested from web-scale data. These taxonomies tend to be huge, noisy, and contains little context. As a result, cleansing and enriching those largescale taxonomies becomes a great challenge. A natural way to enrich a taxonomy is to map the taxonomy to existing datasets that contain rich information. In this paper, we study the problem of match...

متن کامل

Pattern-based automatic taxonomy learning from the Web

The construction of taxonomies is considered as the first step for structuring domain knowledge. Many methodologies have been developed in the past for building taxonomies from classical information repositories such as dictionaries, databases or domain text. However, in the last years, scientists have started to consider the Web as valuable repository of knowledge. In this paper we present a n...

متن کامل

Semantic disambiguation of taxonomies

Polysemy is one of the most difficult problems when dealing with natural language resources. Consequently, automated ontology learning from textual sources (such as web resources) is hampered by the inherent ambiguity of human language. In order to tackle this problem, this paper presents an automatic and unsupervised method for disambiguating taxonomies (the key component of a final ontology)....

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • J. Web Sem.

دوره 2  شماره 

صفحات  -

تاریخ انتشار 2004